Client Report - Can You Predict That?

Course DS 250

Author

Dallin Moak

Show the code
import os

_ = os.getcwd()

source

source material comes from p4_source.py

Elevator pitch

A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)

A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.

QUESTION|TASK 1

Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.

I checked for a simple relationship between before1980 and livearea average. the older ones are a little bit smaller on average, but it’s not much. Also, I checked the sprice average and it seems that the older ones are a little bit cheaper on average. This tells me that a very basic naive bayes model might not be able to find simple single-variable relationships.

Show the code
from p4_source import livearea_chart, sprice_chart

livearea_chart
Show the code
sprice_chart

QUESTION|TASK 2

Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.

Ok the obvious starting point is the GaussianNB model. I’m going to follow standard practice and split out the data set into test and training data, then split things out into features and labels where the label is just values for before1980 and features contains all the other columns in the dataset, but with no cheating by giving the model the answers (AKA excluding the before1980 and yrbuilt columns). I did some basic reading about the GaussianNB parameters of priors and var_smoothing, and it seems like for priors, leaving a default value and letting the balance of classes be set by the training data instead of forcing some assumption of 50-50 is the best. for the var_smoothing, it seems like there’s not any accuracy gained by turning it down lower than the default value. I will increase it if there’s any errors so that it doesn’t get stuck on similar values.

here’s the GaussianNB model I built:

Show the code
from p4_source import gaussianNB_score

f"GaussianNB accuracy: {gaussianNB_score}"
'GaussianNB accuracy: 0.6793193717277487'

_Next, I asked my brother, a data scientist, which model he recommends, and he pointed me to logistic regression. It seems this model is a lot more complicated to use than the NB ones. I ran it with no settings, and it was getting ~= 80% scores. I tried with 1000 iterations and it got up to 85% accuracy.

Show the code
from p4_source import logistic_score
f"Logistic Regression (unscaled) accuracy: {logistic_score}"
'Logistic Regression (unscaled) accuracy: 0.856020942408377'

The sklearn thing gave a ConvergenceWarning so I tried scaling the data with StandardScaler and it got up to 87%:

Show the code
from p4_source import logistic_score_scaled
f"Logistic Regression (scaled) accuracy: {logistic_score_scaled}"
'Logistic Regression (scaled) accuracy: 0.8739092495636999'

_the accuracy scores indicate that linear regression model with scaled data is more effective than the gaussianNB model.

QUESTION|TASK 3

Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.

type your results and analysis here

QUESTION|TASK 4

Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.

type your results and analysis here


STRETCH QUESTION|TASK 1

Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.

type your results and analysis here

STRETCH QUESTION|TASK 2

Join the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recomend to the Client.

type your results and analysis here

STRETCH QUESTION|TASK 3

Can you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.

type your results and analysis here